Skip to content

Conversation

@86xsk
Copy link
Contributor

@86xsk 86xsk commented Jan 30, 2026

Issues

Fixes #2585
Related to #1688
Related to #2411

(Probably a few more, this has been a long-running issue.)

Description

  • Changes the dictionary API to allow more explicitly handling capitalization/orthographic word variations.
    • Replaces WordId with CanonicalWordId and CaseFoldedWordId.
      • CanonicalWordId hashes the input word as is, without lowercasing or normalization.
      • CaseFoldedWordId lowercases and normalizes the word before hashing it (this is the current behavior of WordId).
      • Adds WordIdPair and EitherWordId to make it easier to work with CanonicalWordId/CaseFoldedWordId.
    • Querying the dictionary case-insensitively now means you will get a Vec of words, not an individual word.
      • However, the current behavior, where a word is queried case-insensitively and returns the merged metadata across all of its casing/orthographic variants is preserved in the get_word_metadata_combined functions.
    • Querying the dictionary case-sensitively will now return a specific entry in the curated dictionary, allowing differentiation for case/orthographic variants when desired.
  • Fixes the issue where some words would be linted by SpellCheck despite being present in the dictionary.
  • Makes SpellCheck no longer care about capitalization, fully transferring that responsibility to OrthographicConsistency (which already handles it anyway for the most part).
  • Re-adds "Pr" to the curated dictionary, since it no longer conflicts with "PR".

How Has This Been Tested?

  • cargo test
  • Manually building and testing the modified harper-ls on a few Markdown files

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

86xsk added 19 commits January 20, 2026 01:12
`SpellCheck` shouldn't handle capitalization if
`OrthographicConsistency` is going to do it anyway.
- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`.

- Updates dictionary functions to more explicitly handle casing. There
  are now functions to get a specific word case-sensitively, multiple
  words case-insensitively, and ditto but merge all metadata (which was
  the old behavior).

- Fixes issue where `SpellCheck` would sometimes mark words as incorrect
  if an identical entry with different casing existed in the dictionary
  (e.g. OS, PR, etc.).

- Makes `SpellCheck` no longer care about casing, since that is handled
  by `OrthographicConsistency`.
…mattic#2476)"

This partially reverts commit 5230d6a.

Returns the word to the dictionary, since removing it should no longer
be necessary.
Since casing-related issues are now handled by
`OrthographicConsistency`, not `SpellCheck`.
Expands the criteria in which `OrthographicConsistency` will lint for
incorrect capitalization.

Makes the `no_improper_suggestion_for_macos` test pass.
Allow all word casing/orthography that are defined in the dictionary. If
the dictionary contains the exact word, `OrthographicConsistency` will
skip it.
Both variants are defined in the dictionary, and appear to be valid in
this case.
The test would expect 'Al' to be linted by `OrthographicConsistency` for
not being all-caps.
Remove needless borrow.
@hippietrail
Copy link
Collaborator

hippietrail commented Jan 30, 2026

Does it handle the same word having multiple WordIds? For instance you might have dictionary entries like:

- unpack/Vd # also creates "unpacked"
- packed/VtTU # also creates "unpacked
- unpacked/J # specifically creates "unpacked"

This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc.

@86xsk
Copy link
Contributor Author

86xsk commented Jan 30, 2026

I believe it will merge the metadata for words that have identical canonical word IDs, as seems to be the case here.

image

The code in expand_annotated_word adds or appends metadata based on the canonical word ID (i.e. the hash of the word exactly as stated in the dictionary or as expanded from another word, without lowercasing or normalization). "unpacked" will always have the same word ID regardless of whether it was specifically defined or came from an expansion, so all instances/manifestations of "unpacked" will have their metadata merged.

If in theory we had, say, a proper noun "Unpacked" (whether specifically defined or coming from an expansion), that would not be merged. It would remain a separate entry that can be queried with the appropriate dictionary functions.

However, tokens that originate from a Document still currently use the merged/combined metadata. So an "unpacked" token would have both the "unpacked" and "Unpacked" metadata merged.

// In Document::parse()
// Annotate DictWord metadata
for token in sent.iter_mut() {
    if let TokenKind::Word(meta) = &mut token.kind {
        let word_source = token.span.get_content(&self.source);
        let mut found_meta = dictionary
            .get_word_metadata_combined(word_source) // <---
            .map(|c| c.into_owned());

        if let Some(inner) = &mut found_meta {
            inner.pos_tag = token_tags[i].or_else(|| inner.infer_pos_tag());
            inner.np_member = Some(np_flags[i]);
        }

        *meta = found_meta;
        i += 1;
    } else if !token.kind.is_whitespace() {
        i += 1;
    }
}

@hippietrail
Copy link
Collaborator

hippietrail commented Jan 31, 2026

I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:

  1. One WordID can refer to multiple words - due to case folding
  2. One word can have multiple WordIDs

I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035

Apologies if I've mixed up any concepts myself - it's been a while (-:

@86xsk
Copy link
Contributor Author

86xsk commented Jan 31, 2026

Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a DictWordMetadata has its one and onlyderived_from set, it's not updated again.

I did change the field to store a WordIdPair (storing both a canonical and case-folded word ID for its sole "parent" word), but I don't think that helps too much in this case. If I recall correctly, I did this so I could mimic the previous case-folded lookup in the DerivedFrom pattern, while still storing a canonical parent word ID too.

I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered.

@86xsk 86xsk marked this pull request as draft January 31, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"OS" flagged even though it's in dictionary.dict

2 participants